Supervised Machine Learning Examples

Some examples of supervised machine learning examples in Python. First, load up a ton of modules...



In [34]:

    
import pandas as pd
import itertools
import matplotlib.pyplot as plt
import numpy as np
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn import metrics
pd.options.mode.chained_assignment = None

Load the data

Next, we have to load the data into a dataframe. In order to have a balanced dataset, we will use 10000 records from Alexa which will represent the not malicious domains, and 10000 records from gameoverdga representing the malicious domains.

You can see that at the end we have 10000 of each.



In [2]:

    
df = pd.read_csv( '../../data/dga-full.csv' )
#Filter to alexo and game over
df = df[df['dsrc'].isin(['alexa','gameoverdga'])]
df.dsrc.value_counts()









    Out[2]:





alexa          10000
gameoverdga    10000
Name: dsrc, dtype: int64

Add a Target Column

For our datasets, we need a numeric column to represent the classes. In our case we are going to call the column isMalicious and assign it a value of 0 if it is not malicious and 1 if it is.



In [3]:

    
df['isMalicious'] = df['dsrc'].apply( lambda x: 0 if x == "alexa" else 1 )

Perform the Train/Test Split

For this, let’s create a rather small training data se as it will reduce the time to train up a model. Feel free to try a 15%, 20% or even a 30% portion for the training data (lower percentages for slower machines).

In this example, we will split 30% for train and 70% for test.

Normally you would want most of the data in the training data, but more training data can considerably extend the time neede to train up a model.

We're also going to need a list of column names for the feature columns as well as the target column.



In [39]:

    
train, test = train_test_split(df, test_size = 0.7)
features = ['length', 'dicts', 'entropy','numbers', 'ngram']
target = 'isMalicious'

Create the Classifiers

The next step is to create the classifiers. What you'll see is that scikit-learn maintains a constant interface for every machine learning algorithm. For a supervised model, the steps are:

Create the classifier object
Call the .fit() method with the training data set and the target
To make a prediction, call the .predict() method



In [42]:

    
#Create the Random Forest Classifier
random_forest_clf = RandomForestClassifier(n_estimators=10, 
                             max_depth=None, 
                             min_samples_split=2, 
                             random_state=0)

random_forest_clf = random_forest_clf.fit( train[features], train[target])



In [40]:

    
#Next, create the SVM classifier
svm_classifier = svm.SVC()
svm_classifier = svm_classifier.fit(train[features], train[target])

Comparing the Classifiers

Now that we have two different classifiers, let's compare them and see how they perform. Fortunately, Scikit has a series of functions to generate metrics for you. The first is the cross validation score.



In [44]:

    
scores = cross_val_score(random_forest_clf, train[features], train[target])
scores.mean()









    Out[44]:





1.0

We'll need to to get the predictions from both classifiers, so we add columns to the test and training sets for the predictions.



In [46]:

    
test['predictions'] = random_forest_clf.predict( test[features] )
train['predictions'] = random_forest_clf.predict( train[features] )
test['svm-predictions'] = svm_classifier.predict( test[features])
train['svm-predictions'] = svm_classifier.predict( train[features])



In [47]:

    
test.head()









    Out[47]:






  
    
      
      dsrc
      domain
      length
      dicts
      entropy
      numbers
      ngram
      isMalicious
      predictions
      svm-predictions
    
  
  
    
      21295
      gameoverdga
      qff3vb1xdu1qfym25w01qhx6r9
      26
      0.000000
      4.074876
      9
      0.000000
      1
      1
      1
    
    
      42736
      alexa
      digitalocean
      12
      1.000000
      3.251629
      0
      17.745319
      0
      0
      0
    
    
      46157
      alexa
      manuscriptcentral
      17
      1.000000
      3.499228
      0
      21.640531
      0
      0
      0
    
    
      13538
      gameoverdga
      beialx1qm4flb1e92284pxlsyc
      26
      0.384615
      4.056021
      8
      1.301030
      1
      1
      1
    
    
      18081
      gameoverdga
      muwh97a5egf01yb07qhxdxtco
      25
      0.160000
      4.323856
      7
      1.041393
      1
      1
      1

Confusion Matrix

These are a little confusing (yuk yuk), but are a very valuable tool in evaluating your models. Scikit-learn has a function to generate a confusion matrix as shown below.



In [48]:

    
confusion_matrix( test['isMalicious'], test['predictions'])









    Out[48]:





array([[6956,    1],
       [   0, 7043]])

The code below generates a nicer presentation of the confusion matrix for the random forest classifer.

From: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py



In [50]:

    
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Compute confusion matrix
cnf_matrix = confusion_matrix( test['isMalicious'], test['predictions'])
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Not Malicious', 'Malicious'],
                      title='RF Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Not Malicious', 'Malicious'], normalize=True,
                      title='RF Normalized confusion matrix')

plt.show()

And again for the SVM classifier.



In [49]:

    
# Compute confusion matrix
svm_cnf_matrix = confusion_matrix( test['isMalicious'], test['svm-predictions'])
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(svm_cnf_matrix, classes=['Not Malicious', 'Malicious'],
                      title='SVM Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(svm_cnf_matrix, classes=['Not Malicious', 'Malicious'], normalize=True,
                      title='SVM Normalized confusion matrix')

plt.show()

Feature Importance

Random Forest has a feature which can calculate the importance for each feature it uses in building the forest. This can be calculated with this property:random_forest_clf.feature_importances_.



In [52]:

    
importances = random_forest_clf.feature_importances_



In [21]:

    
importances









    Out[21]:





array([ 0.3 ,  0.08,  0.41,  0.2 ,  0.01])

You can also visualize this with the following code from: #From: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html



In [25]:

    
std = np.std([random_forest_clf.feature_importances_ for tree in random_forest_clf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(test[features].shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(test[features].shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(test[features].shape[1]), indices)
plt.xlim([-1, test[features].shape[1]])
plt.show()









    



Feature ranking:
1. feature 2 (0.410171)
2. feature 0 (0.304949)
3. feature 3 (0.197083)
4. feature 1 (0.082497)
5. feature 4 (0.005300)

You can calculate the accuracy with the metrics.accuracy() method, and finally, there is the metrics.classification-report() which will calculate all the metrics except accuracy at once.



In [56]:

    
pscore = metrics.accuracy_score(test['isMalicious'], test['predictions'])
pscore_train = metrics.accuracy_score(train['isMalicious'], train['predictions'])



In [57]:

    
print( metrics.classification_report(test['isMalicious'], test['predictions'], target_names=['Malicious', 'Not Malicious'] ) )









    



               precision    recall  f1-score   support

    Malicious       1.00      1.00      1.00      6957
Not Malicious       1.00      1.00      1.00      7043

  avg / total       1.00      1.00      1.00     14000



In [58]:

    
svm_pscore = metrics.accuracy_score(test['isMalicious'], test['svm-predictions'])
svm_pscore_train = metrics.accuracy_score(train['isMalicious'], train['svm-predictions'])
print( metrics.classification_report(test['isMalicious'], test['svm-predictions'], target_names=['Malicious', 'Not Malicious'] ) )









    



               precision    recall  f1-score   support

    Malicious       1.00      1.00      1.00      6957
Not Malicious       1.00      1.00      1.00      7043

  avg / total       1.00      1.00      1.00     14000



In [59]:

    
print( svm_pscore, svm_pscore_train)









    



0.999785714286 1.0



In [60]:

    
print( pscore, pscore_train)









    



0.999928571429 1.0

	dsrc	domain	length	dicts	entropy	numbers	ngram	isMalicious	predictions	svm-predictions
21295	gameoverdga	qff3vb1xdu1qfym25w01qhx6r9	26	0.000000	4.074876	9	0.000000	1	1	1
42736	alexa	digitalocean	12	1.000000	3.251629	0	17.745319	0	0	0
46157	alexa	manuscriptcentral	17	1.000000	3.499228	0	21.640531	0	0	0
13538	gameoverdga	beialx1qm4flb1e92284pxlsyc	26	0.384615	4.056021	8	1.301030	1	1	1
18081	gameoverdga	muwh97a5egf01yb07qhxdxtco	25	0.160000	4.323856	7	1.041393	1	1	1